Find me on twitter: LudoBenistant



1 Data to insight to decision


1.1 Business understanding

Our example concerns a big company that wants to understand why some of their best and most experienced employees are leaving prematurely. The company also wishes to predict which valuable employees will leave next.




1.2 Analytic solution

We have two goals: first, we want to understand why valuable employees leave, and second, we want to predict who will leave next.

Therefore, we propose to work with the HR department to gather relevant data about the employees and to communicate the significant effect that could explain and predict employees’ departure.




1.3 Assessing Feasibility

Unfortunately, managers didn’t kept an organised record of why people have left, but we can still find some explications in our data set provided by the HR department.

For our 15 000 employees we know: satisfaction level, latest evaluation (yearly), number of project worked on, average monthly hours, time spend in the company (in years), work accident (within the past 2 years), promotion within the past 5 years, department and salary.




1.4 Analytical Base Table

This is the database from the HR department: (Note that it doesn’t take into account the person that have been fired, transferred or hired in the past year…)

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low



2 Data exploration

At this stage we want to understand the data that compose our Analytical Base Table (ABT) and assess where the quality of it might suffer.

2.1 Data quality report

This table describe the characteristics of each features of our ABT. We can see different statistical measures of central tendency and variation. For example we can see that our attrition rate is equal to 24%, the satisfaction level is around 62% and the performance average is around 71%. We see that on average people work on 3 to 4 projects a year and about 200 hours per months.

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923



2.2 First visualisations


2.2.1 Graph

This graph present the correlations between each variables. The size of the bubbles reveal the significance of the correlation, while the colour present the direction (either positive or negative).

HR_correlation <- hr %>% select(satisfaction_level:promotion_last_5years)
M <- cor(HR_correlation)
corrplot(M, method="circle")


On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.


2.2.2 Data

cor(HR_correlation)
##                       satisfaction_level last_evaluation number_project
## satisfaction_level            1.00000000     0.105021214   -0.142969586
## last_evaluation               0.10502121     1.000000000    0.349332589
## number_project               -0.14296959     0.349332589    1.000000000
## average_montly_hours         -0.02004811     0.339741800    0.417210634
## time_spend_company           -0.10086607     0.131590722    0.196785891
## Work_accident                 0.05869724    -0.007104289   -0.004740548
## left                         -0.38837498     0.006567120    0.023787185
## promotion_last_5years         0.02560519    -0.008683768   -0.006063958
##                       average_montly_hours time_spend_company
## satisfaction_level            -0.020048113       -0.100866073
## last_evaluation                0.339741800        0.131590722
## number_project                 0.417210634        0.196785891
## average_montly_hours           1.000000000        0.127754910
## time_spend_company             0.127754910        1.000000000
## Work_accident                 -0.010142888        0.002120418
## left                           0.071287179        0.144822175
## promotion_last_5years         -0.003544414        0.067432925
##                       Work_accident        left promotion_last_5years
## satisfaction_level      0.058697241 -0.38837498           0.025605186
## last_evaluation        -0.007104289  0.00656712          -0.008683768
## number_project         -0.004740548  0.02378719          -0.006063958
## average_montly_hours   -0.010142888  0.07128718          -0.003544414
## time_spend_company      0.002120418  0.14482217           0.067432925
## Work_accident           1.000000000 -0.15462163           0.039245435
## left                   -0.154621634  1.00000000          -0.061788107
## promotion_last_5years   0.039245435 -0.06178811           1.000000000


On average people who leave have a low satisfaction level, they work more and didn’t get promoted within the past five years.


2.3 Who is leaving?

Let’s create a data frame with only the people that have left the company, so we can visualise what is the distribution of each features:

hr_hist <- hr %>% filter(left==1)
par(mfrow=c(1,3))
hist(hr_hist$satisfaction_level,col="#3090C7", main = "Satisfaction level") 
hist(hr_hist$last_evaluation,col="#3090C7", main = "Last evaluation")
hist(hr_hist$average_montly_hours,col="#3090C7", main = "Average montly hours")

We can see why we don’t want to retain everybody. Some people don’t work well as we can see from their evaluation, but clearly there are also many good workers that leave.

par(mfrow=c(1,2))
hist(hr_hist$Work_accident,col="#3090C7", main = "Work accident")
plot(hr_hist$salary,col="#3090C7", main = "Salary")


In the total of 15 000 employees that compose our database, here are the people that have left:

## [1] 3571

More problematic, here are the total of employees that received an evaluation above average, or spend at least four years in the company, or were working on more than 5 projects at the same time and still have left the company. These are the people the company should have retained.

hr_good_leaving_people <- hr_leaving_people %>% filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
nrow(hr_good_leaving_people)
## [1] 2014

2.4 Why good people leave?

Let’s re-use the data table created above that contain only the most valuable employees and see why they tend to leave.

2.4.1 Graph

hr_good_leaving_people2 <- hr %>% filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
hr_good_people_select <- hr_good_leaving_people2 %>% select(satisfaction_level, number_project: promotion_last_5years)
M <- cor(hr_good_people_select)
corrplot(M, method="circle")


Here it’s much clearer. On average valuable employees that leave are not satisfayed, work on many projects, spend many hours in the company each month and aren’t promoted.




2.4.2 Data

summary(hr_good_leaving_people2)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.090      Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.490      1st Qu.:0.7300   1st Qu.:3.000   1st Qu.:171.0       
##  Median :0.680      Median :0.8300   Median :4.000   Median :218.0       
##  Mean   :0.617      Mean   :0.8015   Mean   :4.159   Mean   :211.8       
##  3rd Qu.:0.830      3rd Qu.:0.9100   3rd Qu.:5.000   3rd Qu.:253.0       
##  Max.   :1.000      Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 4.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.916     Mean   :0.1521   Mean   :0.2061  
##  3rd Qu.: 5.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :2628   high  : 834  
##  1st Qu.:0.00000       technical  :1786   low   :4671  
##  Median :0.00000       support    :1466   medium:4267  
##  Mean   :0.02384       IT         : 808                
##  3rd Qu.:0.00000       product_mng: 582                
##  Max.   :1.00000       marketing  : 561                
##                        (Other)    :1941


Here it’s much clearer. On average valuable employees that leave are not satisfayed, work on many projects, spend many hours in the company each month and aren’t promoted.




3 Modeling

Now we want to predict which valuable employe will leave next.

3.1 Select database

Let’s use the same database than above where we kept the most valuable employees. Here is the summary of that database.

hr_model <- hr %>% filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
summary(hr_model)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.090      Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.490      1st Qu.:0.7300   1st Qu.:3.000   1st Qu.:171.0       
##  Median :0.680      Median :0.8300   Median :4.000   Median :218.0       
##  Mean   :0.617      Mean   :0.8015   Mean   :4.159   Mean   :211.8       
##  3rd Qu.:0.830      3rd Qu.:0.9100   3rd Qu.:5.000   3rd Qu.:253.0       
##  Max.   :1.000      Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 4.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.916     Mean   :0.1521   Mean   :0.2061  
##  3rd Qu.: 5.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :2628   high  : 834  
##  1st Qu.:0.00000       technical  :1786   low   :4671  
##  Median :0.00000       support    :1466   medium:4267  
##  Mean   :0.02384       IT         : 808                
##  3rd Qu.:0.00000       product_mng: 582                
##  Max.   :1.00000       marketing  : 561                
##                        (Other)    :1941

3.2 Predictive modeling

After setting our cross-validation we build and compare different predictive models. The first one use a tree model, the second a naives bayes and the third a logistic regression.

3.2.1 Cross-Validation

# Set the target variable as a factor
hr_model$left <- as.factor(hr_model$left)
## install.packages("caret") 
library("caret")
## Loading required package: lattice
# cross-validation
train_control<- trainControl(method="cv", number=5, repeats=3)
head(train_control)
## $method
## [1] "cv"
## 
## $number
## [1] 5
## 
## $repeats
## [1] 3
## 
## $search
## [1] "grid"
## 
## $p
## [1] 0.75
## 
## $initialWindow
## NULL



3.2.2 Tree learning

library("rpart")
library("rpart.plot")
# train the model 
rpartmodel<- train(left~., data=hr_model, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(rpartmodel,hr_model)
hr_model_tree<- cbind(hr_model,predictions)
# summarize results
confusionMatrix<- confusionMatrix(hr_model_tree$predictions,hr_model_tree$left)
confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7591  276
##          1  167 1738
##                                           
##                Accuracy : 0.9547          
##                  95% CI : (0.9504, 0.9587)
##     No Information Rate : 0.7939          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8586          
##  Mcnemar's Test P-Value : 2.878e-07       
##                                           
##             Sensitivity : 0.9785          
##             Specificity : 0.8630          
##          Pos Pred Value : 0.9649          
##          Neg Pred Value : 0.9123          
##              Prevalence : 0.7939          
##          Detection Rate : 0.7768          
##    Detection Prevalence : 0.8051          
##       Balanced Accuracy : 0.9207          
##                                           
##        'Positive' Class : 0               
## 
# library("ROCR")
# hr_model_tree$predictions <- as.numeric(paste(hr_model_tree$predictions))
# 
# perf.obj <- prediction(predictions=hr_model_tree$predictions, labels=hr_model_tree$left)
# # Get data for ROC curve
# roc.obj <- performance(perf.obj, measure="tpr", x.measure="fpr")
# plot(roc.obj,
#      main="Cross-Sell - ROC Curves",
#      xlab="1 – Specificity: False Positive Rate",
#      ylab="Sensitivity: True Positive Rate",
#      col="blue")
# abline(0,1,col="grey")



3.2.3 Naives Bayes

## Loading required package: kknn
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy
# train the model 
e1071model2 <- train(left~., data=hr_model, trControl=train_control, method="nb")
## Loading required package: klaR
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# make predictions
predictions<- predict(e1071model2,hr_model)
e1071modelbinded <- cbind(hr_model,predictions)
# summarize results
confusionMatrix<- confusionMatrix(e1071modelbinded$predictions,e1071modelbinded$left)
confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7663  601
##          1   95 1413
##                                           
##                Accuracy : 0.9288          
##                  95% CI : (0.9235, 0.9338)
##     No Information Rate : 0.7939          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.76            
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9878          
##             Specificity : 0.7016          
##          Pos Pred Value : 0.9273          
##          Neg Pred Value : 0.9370          
##              Prevalence : 0.7939          
##          Detection Rate : 0.7842          
##    Detection Prevalence : 0.8457          
##       Balanced Accuracy : 0.8447          
##                                           
##        'Positive' Class : 0               
## 



3.2.4 Logistic regression

# train the model 
gmlmodel <- train(left~., data=hr_model, trControl=train_control, method="LogitBoost")
## Loading required package: caTools
# make predictions
predictions<- predict(gmlmodel,hr_model)
gmlmodelbinded <- cbind(hr_model,predictions)
# summarize results
confusionMatrix<- confusionMatrix(gmlmodelbinded$predictions,gmlmodelbinded$left)
confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7671  319
##          1   87 1695
##                                           
##                Accuracy : 0.9585          
##                  95% CI : (0.9543, 0.9623)
##     No Information Rate : 0.7939          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8674          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9888          
##             Specificity : 0.8416          
##          Pos Pred Value : 0.9601          
##          Neg Pred Value : 0.9512          
##              Prevalence : 0.7939          
##          Detection Rate : 0.7850          
##    Detection Prevalence : 0.8176          
##       Balanced Accuracy : 0.9152          
##                                           
##        'Positive' Class : 0               
## 
# library("ROCR")
# gmlmodelbinded$predictions <- as.numeric(paste(gmlmodelbinded$predictions))
# 
# perf.obj <- prediction(predictions=gmlmodelbinded$predictions, labels=gmlmodelbinded$left)
# # Get data for ROC curve
# roc.obj <- performance(perf.obj, measure="tpr", x.measure="fpr")
# plot(roc.obj,
#      main="Cross-Sell - ROC Curves",
#      xlab="1 – Specificity: False Positive Rate",
#      ylab="Sensitivity: True Positive Rate",
#      col="blue")
# abline(0,1,col="grey")



4 Actionable insights

The confusion matrix and the accuracy figures of the different model show that the predictive power is very similar and seems robust. About 95% accuracy and for a Kappa of 84%. We decide to keep the logistic regression model to lay out actionable insights. It’s a very simple model and give the best results.

Here is a plot that show the probability to leave of the employees and their performance. We need to focus on the top right. To do that we build a data table were we rank the probability to leave found in the logistic regression model and the performance, we therefore find the priority for the company.

set.seed(100)
# Keep some data to test again the final model
inTraining <- createDataPartition(hr_model$left, p = .75, list = FALSE)
training <- hr_model[ inTraining,]
testing  <- hr_model[-inTraining,]
# Estimate the drivers of attrition
logreg = glm(left ~ ., family=binomial(logit), data=training)
# Make predictions on the out-of-sample data
probaToLeave=predict(logreg,newdata=testing,type="response")
# Structure the prediction output in a table
predattrition = data.frame(probaToLeave)
# Add a column to the predattrition dataframe containing the performance
predattrition$performance=testing$last_evaluation
plot(predattrition$probaToLeave,predattrition$performance)

Here we display the first 300 employees that the company should retain. After grouping them per department we could email the different managers to tell them which valuable employees might leave soon.

predattrition$priority=predattrition$performance*predattrition$probaToLeave
orderpredattrition=predattrition[order(predattrition$priority,decreasing = TRUE),]
orderpredattrition <- head(orderpredattrition, n=300)
datatable(orderpredattrition)



Last updated on the 11/2015